requirements=c("tidyverse","mice", "caTools", "corrplot", "summarytools", "plotly", "readr", "caret")
for (req in requirements){
if (!require(req, character.only = TRUE)){
install.packages(req)
}
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: mice
##
##
## Attaching package: 'mice'
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
##
##
## Loading required package: caTools
##
## Loading required package: corrplot
##
## corrplot 0.94 loaded
##
## Loading required package: summarytools
##
##
## Attaching package: 'summarytools'
##
##
## The following object is masked from 'package:tibble':
##
## view
##
##
## Loading required package: plotly
##
##
## Attaching package: 'plotly'
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
##
##
## Loading required package: caret
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
The objective of this project is to analyze the statistical data of the Spanish La Liga football league spanning the last 9 seasons and predicting the result for the 2023/2024 season. The dataset, sourced from http://www.football-data.co.uk/, provides comprehensive information on various aspects of each match, including final and half-time results, corner kicks, and disciplinary actions such as yellow and red cards. This dataset serves as a valuable resource for understanding the dynamics of football matches in one of Europe’s top football leagues.
The dataset comprises detailed statistical records of matches played in the Spanish La Liga over the past decade. Each record includes information such as match date, teams involved, final and half-time scores, number of corner kicks, as well as disciplinary actions like yellow and red cards.
The different information of each match collected on the dataset is described in the following table:
| Label | Description |
|---|---|
| Date | Date of the match |
| HomeTeam | Home Team of the match |
| AwayTeam | Away Team of the match |
| FTHG | Full Time Home Team Goals |
| FTAG | Full Time Away Team Goals |
| FTR | Full Time Result (H=Home Win, D=Draw, A=Away Win) |
| HTHG | Half Time Home Team Goals |
| HTAG | Half Time Away Team Goals |
| HTR | Half Time Result (H=Home Win, D=Draw, A=Away Win) |
| HS | Home Team Shots |
| AS | Away Team Shots |
| HST | Home Team Shots on Target |
| AST | Away Team Shots on Target |
| HF | Home Team Fouls Committed |
| AF | Away Team Fouls Committed |
| HC | Home Team Corners |
| AC | Away Team Corners |
| HY | Home Team Yellow Cards |
| AY | Away Team Yellow Cards |
| HR | Home Team Red Cards |
| AR | Away Team Red Cards |
The CSV file downloaded from the website contains data for each season of the Spanish La Liga, starting from the 2009/2010 season and spanning 2022/2023 season. Each season’s data is structured with various match statistics, including final and half-time scores, team information, and disciplinary actions. The dataset provides a comprehensive overview of match outcomes and related metrics for analysis spanning multiple seasons.
I filtered out qualitative variables and statistics related to betting from the dataset, retaining only the essential match statistics for subsequent analysis.
# Read the dataset from the CSV file
football_data <- read_csv("./dataset.csv")
## Rows: 5320 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Date, HomeTeam, AwayTeam, FTR, HTR
## dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(football_data)
## # A tibble: 6 × 21
## Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR HS AS HST
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 29/08… Real Ma… La Coru… 3 2 H 2 1 H 28 9 11
## 2 29/08… Zaragoza Tenerife 1 0 H 0 0 D 17 16 8
## 3 30/08… Almeria Vallado… 0 0 D 0 0 D 20 7 5
## 4 30/08… Ath Bil… Espanol 1 0 H 0 0 D 14 8 4
## 5 30/08… Malaga Ath Mad… 3 0 H 1 0 H 8 16 4
## 6 30/08… Mallorca Xerez 2 0 H 0 0 D 10 7 3
## # ℹ 9 more variables: AST <dbl>, HF <dbl>, AF <dbl>, HC <dbl>, AC <dbl>,
## # HY <dbl>, AY <dbl>, HR <dbl>, AR <dbl>
To ensure the integrity of our analysis, we need to clean the data by checking for missing values, duplicate entries, and inconsistencies in data types.
# Check for missing values
missing_values <- colSums(is.na(football_data))
missing_values[missing_values > 0]
## named numeric(0)
# Convert necessary columns to appropriate data types
football_data$FTR <- factor(football_data$FTR, levels = c("H", "D", "A"), labels = c("Home Win", "Draw", "Away Win"))
# Summary of the cleaned dataset
summary(football_data)
## Date HomeTeam AwayTeam FTHG
## Length:5320 Length:5320 Length:5320 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Mode :character Median : 1.000
## Mean : 1.552
## 3rd Qu.: 2.000
## Max. :10.000
## FTAG FTR HTHG HTAG
## Min. :0.000 Home Win:2508 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 Draw :1320 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.000 Away Win:1492 Median :0.0000 Median :0.0000
## Mean :1.124 Mean :0.6882 Mean :0.4902
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :8.000 Max. :6.0000 Max. :5.0000
## HTR HS AS HST
## Length:5320 Min. : 1.00 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:10.00 1st Qu.: 8.00 1st Qu.: 3.00
## Mode :character Median :13.00 Median :10.00 Median : 4.00
## Mean :13.61 Mean :10.75 Mean : 4.86
## 3rd Qu.:17.00 3rd Qu.:14.00 3rd Qu.: 6.00
## Max. :37.00 Max. :39.00 Max. :18.00
## AST HF AF HC
## Min. : 0.000 Min. : 1.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.:11.00 1st Qu.:11.00 1st Qu.: 4.000
## Median : 3.000 Median :14.00 Median :14.00 Median : 5.000
## Mean : 3.778 Mean :14.06 Mean :13.88 Mean : 5.704
## 3rd Qu.: 5.000 3rd Qu.:17.00 3rd Qu.:17.00 3rd Qu.: 7.000
## Max. :16.000 Max. :33.00 Max. :31.00 Max. :20.000
## AC HY AY HR
## Min. : 0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:0.0000
## Median : 4.000 Median :2.000 Median :3.000 Median :0.0000
## Mean : 4.381 Mean :2.433 Mean :2.671 Mean :0.1241
## 3rd Qu.: 6.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:0.0000
## Max. :17.000 Max. :9.000 Max. :9.000 Max. :3.0000
## AR
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1547
## 3rd Qu.:0.0000
## Max. :3.0000
Distribution of Match Outcomes
p1 <- ggplot(football_data, aes(x = FTR)) +
geom_bar(fill = "lightblue", color = "black") +
labs(title = "Distribution of Match Results", x = "Match Outcome", y = "Count") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p1_interactive <- ggplotly(p1)
p1_interactive
Goals Scored Distribution
# Home Team Goals
p2 <- ggplot(football_data, aes(x = FTHG)) +
geom_histogram(aes(text = ..count..), bins = 10, fill = "green", alpha = 0.7, color = "black") +
labs(title = "Distribution of Home Team Goals", x = "Goals", y = "Count") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_histogram(aes(text = ..count..), bins = 10, fill = "green", :
## Ignoring unknown aesthetics: text
p2_interactive <- ggplotly(p2, tooltip = "text")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
## Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2_interactive
# Away Team Goals
p3 <- ggplot(football_data, aes(x = FTAG)) +
geom_histogram(aes(text = ..count..), bins = 10, fill = "red", alpha = 0.7, color = "black") +
labs(title = "Distribution of Away Team Goals", x = "Goals", y = "Count") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_histogram(aes(text = ..count..), bins = 10, fill = "red", :
## Ignoring unknown aesthetics: text
p3_interactive <- ggplotly(p3, tooltip = "text")
p3_interactive
Home Advantage
# Analyze home advantage
p_home_advantage <- ggplot(football_data, aes(x = FTR, fill = FTR)) +
geom_bar(aes(text = ..count..), position = "dodge", color = "black") +
labs(title = "Home Advantage in Match Outcomes", x = "Match Result", y = "Count") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_bar(aes(text = ..count..), position = "dodge", color =
## "black"): Ignoring unknown aesthetics: text
ggplotly(p_home_advantage, tooltip = "text")
Goals vs. Match Outcome
# Home Goals vs. Match Outcome
p_home_goals <- ggplot(football_data, aes(x = FTR, y = FTHG)) +
geom_boxplot(aes(text = paste("Home Goals: ", FTHG)), fill = "lightblue", color = "black") +
labs(title = "Home Goals vs. Match Outcome", x = "Match Outcome", y = "Home Goals") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_boxplot(aes(text = paste("Home Goals: ", FTHG)), fill =
## "lightblue", : Ignoring unknown aesthetics: text
ggplotly(p_home_goals, tooltip = "text")
# Away Goals vs. Match Outcome
p_away_goals <- ggplot(football_data, aes(x = FTR, y = FTAG)) +
geom_boxplot(aes(text = paste("Away Goals: ", FTAG)), fill = "lightgreen", color = "black") +
labs(title = "Away Goals vs. Match Outcome", x = "Match Outcome", y = "Away Goals") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1))
## Warning in geom_boxplot(aes(text = paste("Away Goals: ", FTAG)), fill =
## "lightgreen", : Ignoring unknown aesthetics: text
ggplotly(p_away_goals, tooltip = "text")
Shots and Match Outcome
We will analyze the relationship between shots and match results by visualizing the number of home and away shots for each match outcome.
# Home Team Shots vs. Match Outcome
home_shots_plot <- ggplot(football_data, aes(x = FTR, y = HS)) +
geom_boxplot(aes(fill = FTR)) +
labs(title = "Home Team Shots vs. Match Outcome", x = "Full Time Result", y = "Home Team Shots")
ggplotly(home_shots_plot)
# Away Team Shots vs. Match Outcome
away_shots_plot <- ggplot(football_data, aes(x = FTR, y = AS)) +
geom_boxplot(aes(fill = FTR)) +
labs(title = "Away Team Shots vs. Match Outcome", x = "Full Time Result", y = "Away Team Shots")
ggplotly(away_shots_plot)
numeric_columns <- football_data[, c("FTHG", "FTAG", "HS", "AS", "HST", "AST", "HF", "AF", "HC", "AC", "HY", "AY", "HR", "AR")]
# correlation matrix
cor_matrix <- cor(numeric_columns, use = "complete.obs")
# Correlation Heatmap
heatmap_plot <- plot_ly(
z = cor_matrix,
x = colnames(cor_matrix),
y = colnames(cor_matrix),
type = "heatmap",
colors = colorRamp(c("blue", "white", "red")),
colorbar = list(title = "Correlation")
) %>% layout(
title = "Correlation Heatmap of Key Match Variables",
xaxis = list(tickangle = 45),
yaxis = list(autorange = "reversed")
)
heatmap_plot
An interesting list to have in order to manage the data is the list of teams. This is extracted using the unique function as follows:
teams <- as.character(unique(football_data[,"HomeTeam"]))
cat(teams, sep = "\n")
## c("Real Madrid", "Zaragoza", "Almeria", "Ath Bilbao", "Malaga", "Mallorca", "Osasuna", "Santander", "Valencia", "Barcelona", "Ath Madrid", "Espanol", "Getafe", "Sevilla", "La Coruna", "Sp Gijon", "Tenerife", "Valladolid", "Villarreal", "Xerez", "Hercules", "Levante", "Sociedad", "Granada", "Betis", "Vallecano", "Celta", "Elche", "Eibar", "Cordoba", "Las Palmas", "Leganes", "Alaves", "Girona", "Huesca", "Cadiz")
To begin the analysis, I have decided to start with only one team to simplify the operations. In this case, I have selected FC Barcelona as my team to analyze. The information is split into two different dataframes: one for the matches played as the Home Team and the other for the matches played as the Away Team.
# Filter Barcelona's matches from the dataset
barcelona_matches <- football_data %>%
filter(HomeTeam == "Barcelona" | AwayTeam == "Barcelona")
# Separate matches by home and away games
barca_home <- barcelona_matches %>%
filter(HomeTeam == "Barcelona")
barca_away <- barcelona_matches %>%
filter(AwayTeam == "Barcelona")
Aggregate statistics are calculated for matches where Barcelona played at home and away. This includes total fouls, red/yellow cards, shots, and shots on target.
# for home matches
barca_home_summary <- barca_home %>%
summarize(
TotalFouls = sum(HF),
TotalRedCards = sum(HR),
TotalYellowCards = sum(HY),
TotalShots = sum(HS),
TotalShotsOnTarget = sum(HST)
)
# for away matches
barca_away_summary <- barca_away %>%
summarize(
TotalFouls = sum(AF),
TotalRedCards = sum(AR),
TotalYellowCards = sum(AY),
TotalShots = sum(AS),
TotalShotsOnTarget = sum(AST)
)
print(barca_home_summary)
## # A tibble: 1 × 5
## TotalFouls TotalRedCards TotalYellowCards TotalShots TotalShotsOnTarget
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2938 20 438 4473 1947
print(barca_away_summary)
## # A tibble: 1 × 5
## TotalFouls TotalRedCards TotalYellowCards TotalShots TotalShotsOnTarget
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2876 23 595 3689 1558
I combined the home and away summaries for easier comparison. Then, the data is reshaped into a long format suitable for plotting, with each statistic represented separately.
# Combine summaries and add MatchType information
combined_summary <- bind_rows(
mutate(barca_home_summary, MatchType = "Home"),
mutate(barca_away_summary, MatchType = "Away")
)
# Reshape for plotting
combined_summary_long <- pivot_longer(combined_summary,
cols = c(TotalFouls, TotalRedCards, TotalYellowCards, TotalShots, TotalShotsOnTarget),
names_to = "Statistic",
values_to = "Count")
A bar plot is created with custom colors and labels for each statistic. This shows the distribution of fouls, red/yellow cards, and shots across home and away matches.
# Define custom color palette for the plot
my_colors <- c("TotalFouls" = "#1f77b4",
"TotalRedCards" = "red",
"TotalYellowCards" = "yellow",
"TotalShots" = "#2ca02c",
"TotalShotsOnTarget" = "violet")
bar_plot <- ggplot(combined_summary_long, aes(x = MatchType, y = Count, fill = Statistic, label = Count)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black") +
geom_text(position = position_dodge(width = 0.9), vjust = -0.5, size = 3,
aes(group = Statistic), color = "black", fontface = "bold", show.legend = FALSE) +
labs(title = "Summary of Barcelona Matches",
y = "Count", x = "Match Type", fill = "Statistic") +
scale_fill_manual(values = my_colors) +
theme_minimal() +
theme(legend.position = "top",
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
interactive_plot <- ggplotly(bar_plot)
interactive_plot
# Display the structure of football_data
str(football_data)
## spc_tbl_ [5,320 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Date : chr [1:5320] "29/08/09" "29/08/09" "30/08/09" "30/08/09" ...
## $ HomeTeam: chr [1:5320] "Real Madrid" "Zaragoza" "Almeria" "Ath Bilbao" ...
## $ AwayTeam: chr [1:5320] "La Coruna" "Tenerife" "Valladolid" "Espanol" ...
## $ FTHG : num [1:5320] 3 1 0 1 3 2 1 1 2 3 ...
## $ FTAG : num [1:5320] 2 0 0 0 0 0 1 4 0 0 ...
## $ FTR : Factor w/ 3 levels "Home Win","Draw",..: 1 1 2 1 1 1 2 3 1 1 ...
## $ HTHG : num [1:5320] 2 0 0 0 1 0 1 1 0 2 ...
## $ HTAG : num [1:5320] 1 0 0 0 0 0 1 3 0 0 ...
## $ HTR : chr [1:5320] "H" "D" "D" "D" ...
## $ HS : num [1:5320] 28 17 20 14 8 10 7 4 8 20 ...
## $ AS : num [1:5320] 9 16 7 8 16 7 11 9 3 9 ...
## $ HST : num [1:5320] 11 8 5 4 4 3 2 3 6 9 ...
## $ AST : num [1:5320] 3 2 1 1 3 3 7 6 1 5 ...
## $ HF : num [1:5320] 18 16 9 11 16 14 18 14 20 10 ...
## $ AF : num [1:5320] 12 17 11 18 8 13 14 10 14 12 ...
## $ HC : num [1:5320] 10 7 12 6 4 6 4 4 7 9 ...
## $ AC : num [1:5320] 3 8 2 3 5 6 14 5 0 7 ...
## $ HY : num [1:5320] 2 1 2 2 4 3 2 2 1 0 ...
## $ AY : num [1:5320] 2 4 2 6 4 1 2 3 2 2 ...
## $ HR : num [1:5320] 0 0 0 0 0 0 0 0 0 0 ...
## $ AR : num [1:5320] 0 0 1 0 0 2 0 0 1 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Date = col_character(),
## .. HomeTeam = col_character(),
## .. AwayTeam = col_character(),
## .. FTHG = col_double(),
## .. FTAG = col_double(),
## .. FTR = col_character(),
## .. HTHG = col_double(),
## .. HTAG = col_double(),
## .. HTR = col_character(),
## .. HS = col_double(),
## .. AS = col_double(),
## .. HST = col_double(),
## .. AST = col_double(),
## .. HF = col_double(),
## .. AF = col_double(),
## .. HC = col_double(),
## .. AC = col_double(),
## .. HY = col_double(),
## .. AY = col_double(),
## .. HR = col_double(),
## .. AR = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Extract numeric variables only
df_corr <- football_data[, sapply(football_data, is.numeric)]
# Create the correlation matrix using Pearson's method
df_corr.cor <- cor(df_corr, method = "pearson")
# Define color palette for heatmap
palette <- colorRampPalette(c("green", "white", "red"))(20)
# Plot correlation heatmap
heatmap(x = df_corr.cor, col = palette, symm = TRUE)
The correlation matrix helps in understanding the linear relationships between pairs of numeric variables by presenting a matrix of correlation coefficients.
This heatmap visualizes the strength and direction of linear relationships between numeric variables in football_data:
Green represents positive correlation. Red represents negative correlation. White is near zero, indicating little or no correlation.
# Select a subset of columns to reduce multicollinearity
df_corr <- df_corr[c(7:16)]
# Recompute the correlation matrix after removing multicollinear variables
df_corr.cor <- cor(df_corr, method = "pearson")
# Plot the updated heatmap
heatmap(x = df_corr.cor, col = palette, symm = TRUE)
This step involves identifying and removing multicollinear variables from the dataset. Multicollinearity occurs when independent variables are highly correlated with each other, which can lead to instability and inflated standard errors in regression analysis. By removing multicollinear variables, we streamline the dataset for further analysis, reducing the risk of multicollinearity-related issues and improving the reliability of regression models.
input_data <- football_data[c(1:3, 6:21)]
Now we split the data into a 70% training dataset and a 30% test dataset.
# Set seed for reproducibility
set.seed(123)
# Split index creation
index <- createDataPartition(input_data$FTR, p = 0.7, list = FALSE)
train_data <- input_data[index, ]
test_data <- input_data[-index, ]
write.csv(train_data, "./training/training.csv", row.names = FALSE)
write.csv(test_data, "./testing/test.csv", row.names = FALSE)